Word-length entropies and correlations of natural language written texts

نویسندگان

  • Maria Kalimeri
  • Vassilios Constantoudis
  • Constantinos Papadimitriou
  • Konstantinos Karamanos
  • Fotis K. Diakonos
  • Harris Papageorgiou
چکیده

We study the frequency distributions and correlations of the word lengths of ten European languages. Our findings indicate that a) the word-length distribution of short words quantified by the mean value and the entropy distinguishes the Uralic (Finnish) corpus from the others, b) the tails at long words, manifested in the high-order moments of the distributions, differentiate the Germanic languages (except for English) from the Romanic languages and Greek and c) the correlations between nearby word lengths measured by the comparison of the real entropies with those of the shuffled texts are found to be smaller in the case of Germanic and Finnish languages.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Entropy analysis of word-length series of natural language texts: Effects of text language and genre

We estimate the n-gram entropies of natural language texts in word-length representation and find that these are sensitive to text language and genre. We attribute this sensitivity to changes in the probability distribution of the lengths of single words and emphasize the crucial role of the uniformity of probabilities of having words with length between five and ten. Furthermore, comparison wi...

متن کامل

Can Zipf Analyses and Entropy Distinguish between Artiicial and Natural Language Texts?

We study statistical properties of natural texts written in English and of two types of artiicial texts. As statistical tools we use the conventional and the inverse Zipf analyses, the Shannon entropy and a quantity which is a nonlinear function of the word frequencies , the frequency relative \entropy". Our results obtained by investigating eight complete books and sixteen related artiicial te...

متن کامل

Word-Length Correlations and Memory in Large Texts: A Visibility Network Analysis

We study the correlation properties of word lengths in large texts from 30 ebooks in the English language from the Gutenberg Project (www.gutenberg.org) using the natural visibility graph method (NVG). NVG converts a time series into a graph and then analyzes its graph properties. First, the original sequence of words is transformed into a sequence of values containing the length of each word, ...

متن کامل

The word entropy of natural languages

The average uncertainty associated with words is an informationtheoretic concept at the heart of quantitative and computational linguistics. The entropy has been established as a measure of this average uncertainty also called average information content. We here use parallel texts of 21 languages to establish the number of tokens at which word entropies converge to stable values. These converg...

متن کامل

Entropy, Transinformation and Word Distribution of Information-Carrying Sequences

We investigate correlations in information carriers, e.g. texts and pieces of music, which are represented by strings of letters. For information carrying strings generated by one source (i.e. a novel or a piece of music) we find correlations on many length scales. The word distribution, the higher order entropies and the transinformation are calculated. The analogy to strings generated through...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of Quantitative Linguistics

دوره 22  شماره 

صفحات  -

تاریخ انتشار 2015